Extending k - means with the description comes first approach
نویسندگان
چکیده
This paper describes a technique for clustering large collections of short and medium length text documents such as press articles, news stories and the like. The technique called description comes first (DCF) consists of identification of related document clusters, selection of salient phrases relevant to these clusters and reallocation of documents matching the selected phrases to form final document groups. The advantages of this technique include more comprehensive cluster labels and clearer (more transparent) relationship between cluster labels and their content. We demonstrate the DCF by taking a standard k-means algorithm as a baseline and weaving DCF elements into it; the outcome is the descriptive kmeans (DKM) algorithm. The paper goes through technical background explaining how to implement DKM efficiently and ends with the description of an experiment measuring clustering quality on a benchmark document collection 20-newsgroups. Short fragments of this paper appeared at the poster session of the RIAO 2007 conference, Pittsburgh, PA, USA (electronic proceedings only).
منابع مشابه
A hybrid DEA-based K-means and invasive weed optimization for facility location problem
In this paper, instead of the classical approach to the multi-criteria location selection problem, a new approach was presented based on selecting a portfolio of locations. First, the indices affecting the selection of maintenance stations were collected. The K-means model was used for clustering the maintenance stations. The optimal number of clusters was calculated through the Silhou...
متن کاملOrganizational Learning and Knowledge Spillover in Innovation Networks: Agent-Based Approach (Extending SKIN Framework)
In knowledge-based economy, knowledge has a public good and non-rivalry nature. Firms build their own knowledge stock not only by means of internal R&D and collaboration with partners, but also by means of previously spilled over knowledge of other firms and public research laboratories (such as universities). Firms based on their absorptive capacity, and level of intra-industry and extra-indus...
متن کاملتأمین و گسترش برابری فرصتها و عدالت آموزشی در آموزش و پرورش استان اصفهان
Plan: The present research has studied the present and desired strategies for facing the challenge of providing and extending equal opportunity and educational justice in education in Isfahan. Method: This research has studied the present and desired strategies for providing and extending equal opportunity and educational justice in education in Isfahan. 126 person has selected as statisti...
متن کاملProposing an approach to calculate headway intervals to improve bus fleet scheduling using a data mining algorithm
The growth of AVL (Automatic Vehicle Location) systems leads to huge amount of data about different parts of bus fleet (buses, stations, passenger, etc.) which is very useful to improve bus fleet efficiency. In addition, by processing fleet and passengers’ historical data it is possible to detect passenger’s behavioral patterns in different parts of the day and to use it in order to improve fle...
متن کاملتأمین و گسترش برابری فرصتها و عدالت آموزشی در آموزش و پرورش استان اصفهان
Plan: The present research has studied the present and desired strategies for facing the challenge of providing and extending equal opportunity and educational justice in education in Isfahan. Method: This research has studied the present and desired strategies for providing and extending equal opportunity and educational justice in education in Isfahan. 126 person has selected as statisti...
متن کامل